Skip to content

refactor(config): replace hydra/omegaconf with typed pydantic+tyro#21

Open
GrigoryEvko wants to merge 943 commits into
FusionBrainLab:mainfrom
GrigoryEvko:feat/hydra-cutover
Open

refactor(config): replace hydra/omegaconf with typed pydantic+tyro#21
GrigoryEvko wants to merge 943 commits into
FusionBrainLab:mainfrom
GrigoryEvko:feat/hydra-cutover

Conversation

@GrigoryEvko

@GrigoryEvko GrigoryEvko commented May 21, 2026

Copy link
Copy Markdown

TL;DR

Replaces the Hydra/OmegaConf YAML configuration layer with typed Pydantic v2 schemas driven by a tyro CLI entry point. The runtime no longer parses YAML or resolves string interpolation; every experiment is a regular Python module that builds an ExperimentConfig and hands it to a renamed entry point at run.py.

Beyond the cutover itself, the audit surfaced and fixed 29 pre-existing bugs and latent defects — including a leaked API key in run-ID hashes, a race condition in the round-robin island selector, two O(N²) hot paths now 14–23× and 100–3200× faster, and a class of assert-based validators that silently corrupted output under python -O.

What this delivers

  • Typed config end-to-end. Pydantic v2 schemas in gigaevo/config/schemas/ cover every subsystem. Mistyped flags fail at parse time instead of mid-experiment.
  • Self-documenting CLI. python run.py <experiment> --help lists every overridable field with its description; Field(description=...) is present on every user-facing field.
  • Discoverable composition. Presets in gigaevo/config/{algorithm,engine,llm,pipeline,problem,runner}_presets.py compose into experiments via plain function calls — IDE jump-to-definition, autocomplete, and type-checker support all work.
  • Reproducibility artefact. Each run writes output_dir/{experiment_id}/config.json, where experiment_id = sha256(model_dump_json())[:12] — stable across reruns of the same config, distinct under any change.
  • Subprocess-isolated sweeps. gigaevo/sweep.py runs a parameter grid as N independent subprocesses, immune to GIL / global-state / CUDA-context leaks across runs.
  • Smaller dependency surface. Drops hydra-core + omegaconf; adds the lighter tyro. Faster import, fewer transitive deps.
  • Cleaner test isolation. No global resolver registry to leak between tests; no register_resolvers autouse needed.
  • No more MISSING sentinels. Pydantic validation refuses to construct a half-filled config, so subsystems never see partial state.
  • 9 reference experiments under experiments/ demonstrate the patterns: env-driven secrets via default_factory, discriminated-union variants selected by kind=..., preset composition, override syntax.

Performance wins

Two long-standing O(N²) hot paths reworked to O(N) with explicit micro-benchmarks:

Hot path Where Speedup
_compute_pareto_front gigaevo/evolution/strategies/migrant_selectors.py 14.9× at N=50, 22.9× at N=200, 23.1× at N=500
EvolutionaryStatisticsCollector._process gigaevo/programs/stages/collector.py 114× at N=200×M=5, 570× at N=1000×M=10, 3205× at N=5000×M=50
ChainFeatureExtractor.extract regex pass gigaevo/evolution/scheduling/feature_extractor.py 2× (single-pass merge of two re.finditer walks)

Reliability — bugs and latent defects fixed

The cutover audit went deep across the whole repo. 29 distinct issues were found and fixed:

Security / credentials

  • ChatOpenAIConfig.api_key leaked into reproducibility artefacts. The key was being serialised into output_dir/<experiment_id>/config.json, and worse, into experiment_id itself (sha256 of model_dump_json()), so run IDs depended on credential rotation. Pydantic exclude=True keeps the key in-memory only.
  • Real OpenRouter API key committed in problems/chains/musique/shared_config.py and problems/chains/musique_retrieval/shared_config.py. Replaced with os.environ.get("OPENROUTER_API_KEY", ""). (The leaked key remains in git history and should be rotated by its owner — third-party key, not anyone on this team.)

Correctness

  • RecordCardExtended.__init__ shadowed the dataclass init and never applied field(default_factory=...) defaults. Reading card.usage, card.keywords, card.evolution_statistics, card.works_with, or card.links raised AttributeError; dataclasses.asdict(card) exploded. change_motivation was mandatory in body but missing from required_fields, so import_idea_extended(is_forced=False) was dead on arrival. Lock-in tests added.
  • RoundRobinIslandSelector._idx race condition. Concurrent threads double-skipped or repeated islands. Added threading.Lock; 8-thread × 25-call uniform-histogram test asserts exact balance.
  • RidgePredictor.predict held the model lock across CPU-bound extract. Snapshot under lock, release, extract + predict on captured locals. Concurrency lock-in test added.
  • DagRunner.stop cancelled _metrics_collector_task without await → "Task was destroyed but it is pending" warnings + writer-ref retained past storage.close(). Now awaits with suppress(CancelledError).
  • DagRunner._launch fire-and-forget cancel on failed transition tasks → tasks lingered pending until GC. Routed through _cancel_task (which awaits with timeout).
  • cfg.problem.build() called twice in build_object_graph → double metrics.yaml reload per graph build. Threaded the already-built ProblemContext.
  • sweep._run_one aborted the entire pool on worker-spawn OSError (E2BIG/EMFILE/ENOMEM) — one bad spawn dropped every queued sibling run. Now logs and returns 1 to preserve "best-effort across all runs" semantics.
  • _dump_resolved_config orphaned .config.*.tmp files on write failure. Wrapped in try/finally with idempotent unlink.
  • chain_runner._run_chain_on_dataset_stepwise:429 was dropping the sample argument to _resolve_reference, so $sample.X references silently resolved to "". Latent landmine — the only consumer (musique_retrieval) routes through the non-stepwise variant today, but any future stepwise consumer would have been silently broken. Threaded dataset[i] through. 11 unit tests; reverted patch confirms the regression.
  • problems/prompts/utils.py:158 client.call_logs[0] dropped retry call logs and would IndexError on empty. New _aggregate_call_logs helper sums across all attempts; returns a zero CallLog on empty input. 3 unit tests.
  • remove_boxed in 3 problem helpers used bare assert s[:len(left)] == left and assert s[-1] == "}". Under regular Python they raised AssertionError on \boxed{42 (truncated) or \boxed{42}xyz (trailing garbage) and crashed the entire extract_answer loop in validate.py. Under python -O the assertions were stripped, yielding a corrupted slice. Replaced with explicit return None. 21 parametrized tests.
  • problems/prompts/ifbench/validate.py:37 mutated source DataFrame in place. to_dict(orient="records") shares list-cell references with the source DataFrame; the in-place rewrite corrupted subsequent iterations. Replaced with a local binding.
  • problems/prompts/jigsaw_community_rules/validate.py:46 returned None fitness on degenerate input; the downstream consumer in strategies/utils.py:79 does -value, which crashes on None. Returns 0.0 now.
  • 3 chain validators had -> dict annotations on (metrics, failures) tuple returns — annotation lied about the contract. Fixed hover/static, hotpotqa/static_ra, hotpotqa/static_a.
  • gigaevo/__init__.py had a dead pydantic.config.configure(compile="jit") call — that API has never existed in pydantic 2.x. The surrounding try/except Exception: pass silently swallowed AttributeError on every package import. Removed.

Latent bugs (would have bitten under specific conditions)

  • tools/lineage.py:226 sort key returned float | NoneTypeError if any program had None fitness. Use -math.inf substitute.
  • tools/lineage.py::_walk_lineage no cycle guard → infinite loop on corrupted parent chain (A→B→A or self-loop). Visited-set guard + 5 regression tests.
  • tools/redis2pd.py non-atomic df.to_csv → corrupt CSVs on concurrent runs or interrupts. Added _atomic_write_csv (tempfile + os.replace).
  • 3 Redis clients leaked in tools/{utils,fitness_vs_time,throughput_plot}.py — the throughput plotter scaled the leak with fan-out. Wrapped in try/finally.
  • Deprecated asyncio.get_event_loop() at 11 callsites across test_bandit.py, test_coevolution_pipeline.py, test_redis_storage.py, test_wrapper_enhanced.py. Bites whenever any prior event loop in the thread has been closed (exactly what pytest-asyncio does between tests); Python 3.13 removes it entirely. Migrated to asyncio.run() / get_running_loop().
  • stage_timeout accepted on 6 builder schemas whose runtime constructor ignored it — silent user surprise. Moved to the two builders that actually consume it; lock-in tests reject the field on the others via extra="forbid".
  • DEFAULT_BINNING_TYPE: Final[str] mistyped against BinningType = Literal["linear"] → 5 invariance errors across algorithm_presets.py. Retyped.
  • LoggingConfig.build_writer annotated -> GenericLogger but returned CompositeLogger (siblings under LogWriter, not subclasses). The existing test already asserts CompositeLogger; annotation was the lie. Corrected.
  • experiments/prompt_coevolution.py main_redis_db=0 was a literal — overriding --redis.db N on tyro broke the coevolved-prompt fetcher (main wrote to DB N while the fetcher stayed at 0). Threaded redis.db through both sides.

Why now

The Hydra layer was leaking OmegaConf semantics into the runtime: MISSING sentinels reached object construction, ${ref:X} resolution ran lazily and produced unhelpful tracebacks, the global resolver registry made test isolation awkward, and YAML interpolation was being asked to do work that wanted real Python expressions. A typed config model removes that whole class of problem.

What changes for users

Entry point rename

- python -m hydra_main +experiment=steady_state ...
+ python run.py experiments/steady_state.py [overrides ...]

YAML → Python experiment

A YAML experiment becomes a Python module exposing a single experiment() function that returns ExperimentConfig. The 9 reference experiments in experiments/ show the patterns.

Overrides

Use tyro's dotted-path syntax instead of Hydra's +key=value:

python run.py experiments/steady_state.py --redis.db 7 --engine.max-generations 50

python run.py experiments/<file>.py --help lists every overridable field with its description.

Sweeps

gigaevo/sweep.py runs a parameter grid as N independent subprocesses, isolating GIL / global-state issues. Each run is invoked exactly as a normal run.py invocation; sweep definitions are Python dicts.

Schema surface

Module Covers
schemas/experiment.py ExperimentConfig root, experiment_id hash, cross-field validators
schemas/algorithm.py island topologies, MAP-Elites, single/multi-island, discriminated union
schemas/engine.py steady-state, generational, bus-backed; BusedEngineConfig
schemas/pipeline.py DAG builder variants (default, auto, context, optuna_opt, cma_opt, algotune_speed, structural_metrics, problem_specific)
schemas/llm.py ChatOpenAIConfig / bandit / heterogeneous router discriminated union
schemas/redis.py + schemas/migration_bus.py dataplane and migration-bus connection settings
schemas/problem.py, schemas/prompt.py, schemas/logging.py, schemas/scheduling.py, schemas/runner.py remaining subsystem configs

Field(description=...) is present on every user-facing field so --help is self-documenting.

Test plan

  • Full suite (excluding pre-existing tests/test_tools/test_manifest.py collection error unrelated to this branch) passes locally: ~5,900 tests green.
  • All 9 shipped experiments pass python run.py <experiment> --dry-run.
  • python run.py --help and python run.py <experiment> --help produce informative output (Field descriptions rendered inline).
  • Parity test harness compared YAML-driven vs schema-driven object graphs across the reference experiments before the YAML deletion landed; harness was removed once parity held.
  • Schema round-trip: cfg.model_dump_json() → ExperimentConfig.model_validate_json(...) is identity across the reference experiments.
  • Subprocess-isolated sweep regression test.
  • Concurrency lock-in test for the round-robin island selector (threading.Lock-guarded counter, 8-thread × 25-call uniform-histogram assertion).
  • Concurrency lock-in test for RidgePredictor.predict not holding the lock across extract.
  • Performance lock-in: _compute_pareto_front (14–23×) and EvolutionaryStatisticsCollector snapshot processing (114–3205×).
  • Regression tests for every fixed bug above (see commit history; each fix carries the test that catches its regression).

Conflict map with open PRs

This branch deletes the YAML tree and reshapes the config surface — that overlaps several open PRs at file-level only; the intent is orthogonal in every case:

PR Overlap Resolution
#2, #3, #4, #5, #6, #7, #8 none (fix-only PRs in disjoint modules) clean merge in either order
#10 (sanitize) gigaevo/config/helpers.py was reshaped here; gigaevo/utils/text_sanitize.py is unchanged in this branch take both sides; helpers.py reshape supersedes pre-cutover lines
#11 (xdist) pytest.ini, pyproject.toml take #11's pytest config + this branch's tyro dep
#12, #13 LLM module only; no config-layer overlap clean
#14 (loky-executor) gigaevo/entrypoint/constants.py take #14's constants reshape
#15 (error context) gigaevo/runner/dag_runner.py unchanged here clean
#16 (pipeline hygiene) gigaevo/config/helpers.py, gigaevo/entrypoint/default_pipelines.py both reshape the same modules; the second-merged PR rebases on the first
#17 (aiohttp) pyproject.toml, gigaevo/llm/models.py, gigaevo/infra/* — config layer untouched clean
#19 (asyncio-deprecation) this branch independently migrated asyncio.get_event_loop() callers; #19 superset on main is preferred rebase #19 onto post-cutover main; identical sites already on this branch can be dropped
#20 (dataplane-foundation) deletes a different set of files (Redis substrate); doesn't touch gigaevo/config/schemas/* or experiments/* clean at the config surface; engine wiring rebase needed

No PR is blocked by this branch; merge order is reviewer's preference.

Out of scope / follow-ups

  • tests/test_tools/test_manifest.py references a tools.experiment.manifest module that has never existed in the repo (its production module was never committed); pre-existing collection error. Not addressed here.
  • A handful of # type: ignore[misc] comments on MagicMock.__class__ rebinding in tests — documented pattern, won't repay refactoring.
  • tools/status.py / tools/fitness_vs_time.py redis-py Awaitable type stubs are a known false-positive class; documented via typing.cast, no runtime effect.

KhrulkovV and others added 30 commits April 2, 2026 15:50
…rns AnyCard

Core change: normalize_memory_card returns MemoryCard | ProgramCard (Pydantic
models) instead of dict[str, Any]. All internal code uses attribute access
(card.description) not dict access (card.get("description")).

Production code changes:
- card_conversion.py: normalize_memory_card returns AnyCard; card_to_concept_content,
  build_entity_meta, format_search_results, is_program_card all accept AnyCard
- memory.py: self.memory_cards is dict[str, AnyCard]; _persist_index serializes
  via model_dump(); _synthesize_results uses model attribute access; save_card
  accepts dict | AnyCard at boundary
- memory_write_example.py: load_memory_cards normalizes all output to AnyCard
- models.py: ProgramCard gains keywords, strategy, links fields;
  validate_assignment=True for mutability

Boundary pattern:
- External input (JSON, API responses, user dicts) → normalize_memory_card → AnyCard
- Internal operations → attribute access (card.field)
- Serialization (JSON, API) → card.model_dump() at the boundary
- card_update_dedup.py stays dict-based (LLM output parsing) — callers pass
  model_dump() when crossing the boundary

813 tests pass, ruff check + format clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor: ideas_tracker cleanup — loguru + sys.path removal
refactor: dict → Pydantic — normalize_memory_card returns AnyCard
Automatically generated by python-semantic-release
Create gigaevo/memory/__init__.py with curated public API:
- AmemGamMemory, MemoryCard, ProgramCard, AnyCard, ConnectedIdea
- normalize_memory_card, GigaEvoMemoryBase
- LocalMemorySnapshot, MemoryCardExplanation, Strategy

Update gigaevo/memory/shared_memory/__init__.py with same exports.

Users can now import from `gigaevo.memory` instead of deep paths:
  from gigaevo.memory import AmemGamMemory, MemoryCard

5 tests verify: __all__ completeness, package imports, subpackage imports,
normalize roundtrip, AmemGamMemory construction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Test gap: add LocalMemorySnapshot and Strategy to test_import_from_package_root
  (previously 2/10 exports untested for importability)
- Circular import fragility: change `from gigaevo.memory import config` to
  `import gigaevo.memory.config as config` in 3 files — avoids relying on
  partial parent-package init during import chain

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move all memory-related test files from tests/ root into tests/memory/
subdirectory. Zero test loss: 483 tests before, 483 tests after.

Files moved:
- test_amem_gam_memory.py (67 tests)
- test_card_update_dedup_extended.py (75 tests)
- test_memory_api_search.py (21 tests)
- test_memory_card_update_dedup.py (6 tests)
- test_memory_contracts.py (21 tests)
- test_memory_cycle5.py (17 tests)
- test_memory_deeper.py (21 tests)
- test_memory_e2e_scenarios.py (21 tests)
- test_memory_engine_interaction.py (10 tests)
- test_memory_full_agentic.py (15 tests)
- test_memory_integration.py (26 tests)
- test_memory_known_bugs.py (20 tests)
- test_memory_models.py (16 tests)
- test_memory_operator_integration.py (14 tests)
- test_memory_public_api.py (5 tests)
- test_memory_with_fake_agentic.py (24 tests)
- test_memory_write_example_extended.py (22 tests)
- test_memory_write_program_cards.py (3 tests)
- test_normalize_memory_card.py (66 tests)
- test_pydantic_cards.py (13 tests)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

refactor(memory): consolidate test files into tests/memory/
Add `from __future__ import annotations` to all 41 memory module files
that were missing it. Remove duplicate `_safe_get` from
a_mem_memory_creation.py (now imports from utils.py). Auto-fix 4
UP037 violations (unnecessary quoted type annotations now that future
annotations are active).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Was `-> dict[str, Any]` but actually returns `AnyCard` (Pydantic model).
Found by chaos-hacker review — prevents TypeError trap for future callers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): type quality improvements
Replaced all print() calls with loguru logger in 6 files:
- A_mem/agentic_memory/memory_system.py: 4 prints → logger
- A_mem/agent/agent_class.py: 2 prints → logger
- GAM_root/gam/agents/research_agent.py: 38 prints → logger
- GAM_root/gam/retriever/index_retriever.py: 2 prints → logger
- GAM_root/gam/schemas/page.py: 2 prints → logger
- GAM_root/gam/schemas/memory.py: 2 prints → logger

Also removed old-style `logger = logging.getLogger(__name__)` in
memory_system.py (replaced by loguru import).

509 tests pass, ruff clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor: replace 50 print() with loguru in A_mem + GAM_root
Automatically generated by python-semantic-release
exp: hover/no-deep-retrieval — ablation of retrieve_deep (k=10)
Rename map:
- test_normalize_memory_card → test_card_normalization
- test_memory_card_update_dedup → test_card_dedup
- test_card_update_dedup_extended → test_card_dedup_edge_cases
- test_amem_gam_memory → test_memory_backend
- test_memory_deeper → test_memory_backend_internal
- test_memory_full_agentic → test_memory_backend_agentic
- test_memory_with_fake_agentic → test_memory_backend_fakes
- test_memory_cycle5 → test_api_sync
- test_memory_operator_integration → test_mutation_operator
- test_memory_engine_interaction → test_engine_integration
- test_memory_write_example_extended → test_write_pipeline
- test_memory_write_program_cards → test_write_programs
- test_memory_e2e_scenarios → test_scenarios
- test_memory_integration → test_roundtrip
- test_memory_known_bugs → test_edge_cases
- test_concept_api_client → test_api_client (moved to tests/memory/)
- test_openai_inference → test_llm_inference (moved to tests/memory/)
- test_data_components, test_runtime_config (moved to tests/memory/)

666 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor: rename memory test files — descriptive names
Every suppression replaced with proper typing:
- Stage base: ClassVar[type[StageIO]] instead of unbound TypeVars
- json.py: single dumps/loads definitions with types.ModuleType backend
- LLM agents: TypedDict fields widened to accept None (truthful initial state)
- Redis coevolution: _get_redis() returns AsyncRedis instead of object
- DAG/engine/trackers: invariant assertions replacing silent suppression
- analyzer.py: fixed wrong return type (dict → IncomingIdeas)

4468 tests pass, lint clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor: remove all 27 type: ignore comments
…p Dynamic Chains

New problem variant: chains/hover/full7_no_deep (7-step max, standard retrieval only).
Design approved by Reviewer-2. Two-phase protocol: Phase A builds memory bank,
Phase B tests memory-augmented vs standard mutation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tr])

RecordCardExtended.aliases is list[dict[str, dict[str, str|list[str]]]]
but MemoryCard.aliases expects list[str]. The _to_list() helper passed
dicts through unchanged, causing Pydantic validation crash at the
memory write pipeline step after ideas_tracker completes.

Added _flatten_aliases() that extracts description strings from the
nested dict format while preserving plain string aliases.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RecordCardExtended.aliases is list[dict] (version history with nested
{experiment_id: {description, programs, explanations}}). The Pydantic
migration (846299e) incorrectly typed MemoryCard.aliases as list[str],
causing a validation crash when memory_write_pipeline passes ideas_tracker
output through normalize_memory_card.

Root cause: Pydantic migration assumed aliases are simple strings, but
Petr's original design uses them as structured version history. Fix the
type at the model level instead of adding a flattening adapter.

Reverts the _flatten_aliases band-aid from the previous commit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nLab#2, PR #161)

Added tests that verify normalize_memory_card and load_memory_cards handle
ideas_tracker's structured alias format (list[dict] version history) without
crashing. This test would have caught the Pydantic type mismatch that crashed
the memory write pipeline (aliases: list[str] → list[dict]).

Tests added:
- test_aliases_with_ideas_tracker_dict_format (test_normalize_memory_card.py)
- test_aliases_mixed_types (test_normalize_memory_card.py)
- test_ideas_tracker_dict_aliases_preserved (test_memory_write_example_extended.py)

All 91 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Redis WATCH deprecation: use pipeline context in archive_storage.py
- AsyncMock coroutine warnings: set storage.snapshot = MagicMock() (bump() is sync)
- ast.Str deprecation: use ast.Constant only (Python 3.14 compat)
- Optuna ExperimentalWarning: suppress around TPESampler/PedAnovaImportanceEvaluator
- Unclosed file handles: pathlib.Path.read_text() in test_scheduling.py
- matplotlib tight_layout: layout="tight" on subplots() in comparison.py
- Island __len__ RuntimeWarning: suppress in intentional error test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Automatically generated by python-semantic-release
The previous literal token was a real, live OpenRouter credential committed
to the source tree. Switch both musique and musique_retrieval shared
configs to read OPENROUTER_API_KEY from the environment so the value is no
longer redistributed with the repo. The committed key must still be
revoked and rotated.
…cstrings

Rewrites the lineage-race and ingestion-atomicity class docstrings to
describe the present-tense contract being exercised rather than narrating
prior root-cause investigations or pinning to brittle source-file
locations.
pydantic 2.x has no module-level ``configure`` API; the call has been
raising ``AttributeError`` since the rename and the surrounding
try/except swallowed it on every import. Removes the block, the import,
and the silenced exception path.
… arg

The Top-N path called list.sort() with a key returning float | None which
raises TypeError if any program has None fitness. The previous filter
suppressed that in practice, but a stray None would crash mid-sort. The
fallback now substitutes -inf so None values sink to the end deterministically.

_walk_lineage accepted a metric argument it never consulted; the chain
walk depends only on parent edges. Drop it from the signature and call
sites (including the regression tests).
…nce check

_compute_pareto_front extracted fitness values inside the inner O(N^2)
loop, re-running extract_fitness_values twice per pair. Pre-extract once
per program then iterate over the cached vectors, matching the pattern
already used by ParetoFrontArchiveRemover.order_candidates.

Micro-benchmark with 3 fitness keys (N candidates):
  N=50:  9.53 ms -> 0.64 ms (14.9x)
  N=200: 109.32 ms -> 4.77 ms (22.9x)
  N=500: 365.29 ms -> 15.80 ms (23.1x)

tests/evolution/test_migrant_selectors.py: 10 passed.
spec_from_file_location returns ModuleSpec | None and its loader is
Optional. The previous direct .loader.exec_module() chain would raise
AttributeError instead of a useful message if either was None. The
assert documents the invariant and gives pyright the narrowing it
needs.
…h threading.Lock

The round-robin index was advanced under no synchronization. Concurrent
callers from multiple OS threads could read the same value, write the
same successor, and either repeat or skip an island. Wrap the RMW in
a threading.Lock and pin the local index for the return.

Covered by a new test that drives select_island from eight threads,
each with its own event loop, and asserts a perfectly uniform island
histogram across 200 advances.
…action

RidgePredictor.predict previously held the model lock across the
extractor.extract call and the sklearn predict call. Extraction is a
pure, potentially expensive operation that does not touch any of the
predictor's mutable state, so serializing concurrent predictions through
it is wasted contention.

Snapshot the (model, feature_keys) pair under the lock, then release it
before extracting features and invoking predict on the captured local
references. The sklearn model is immutable after fit, so the captured
reference remains valid for the duration of the call. The no-model
fallback behaviour is preserved exactly.

A new probe test acquires the predictor lock non-blocking from inside a
custom extractor and asserts it is free on every concurrent call, so a
regression that re-introduces lock-held extraction would surface as a
False in the recorded lock-state list.
EvolutionaryStatisticsCollector._process re-filtered the full population
by iteration metadata for every program in the snapshot, repeating the
O(N) scan N times. Bucket programs by iteration once in
_ensure_population_cache (alongside the existing per-generation cache)
then look up the iteration entry by key. Skipping programs whose
iteration metadata is absent preserves the existing None-iteration
fallback when the snapshot excludes metadata.

Micro-benchmark of the filter pattern (M iterations across N programs):
  N=200 M=5:    2.99 ms -> 0.03 ms (~114x)
  N=1000 M=10:  111.59 ms -> 0.20 ms (~570x)
  N=5000 M=50:  8410.70 ms -> 2.62 ms (~3205x)

tests/stages/test_collector.py: 29 passed.
tests/benchmarks/test_collector_scaling.py: 12 passed.
The stepwise tool-step path passed (ref, outer_context, step_outputs)
to _resolve_reference but omitted the per-sample dict, so $sample.X
references silently resolved to the empty string. Latent today because
no enabled stepwise consumer depends on $sample.* yet, but it is a
correctness landmine for future tool inputs that need sample fields.

Add a regression test covering the stepwise dispatch path plus the
existing reference-resolution branches.
Every public field on the typed config schemas gains a one-sentence
description. The CLI's --help layer (tyro) reads these and renders them
next to each flag, so end users see what every override does instead of
just the default value.

Covers algorithm, engine, experiment, llm, logging, migration_bus,
pipeline, problem, prompt, redis, runner, and scheduling. Internal
fields kept under a clear class-level docstring (the discriminated-union
markers and structural list fields whose semantics are explained in the
class header) are left alone.
_process_sample read client.call_logs[0], which both IndexErrors when
no log was appended and silently drops every retry attempt beyond the
first. The retry decorator on LLMClient.__call__ can push multiple
entries (each successful API hit appends one) before the call that
yielded the returned response, so the existing read understated the
sample's budget consumption.

Introduce a private aggregator that sums prompt_tokens, completion_tokens,
cost, and cost_utilization across all per-attempt entries, and falls
back to a zero CallLog on the empty-list branch. The fix is contained
to utils.py and does not touch the fenced client.
remove_boxed previously used bare ``assert`` statements to enforce
boxed-expression shape: ``\boxed{42xyz`` (trailing garbage) and
``\boxed{42`` (missing closing brace) both raised AssertionError, and
under ``python -O`` the assertions are stripped — turning structural
checks into silent fall-through that corrupts the returned slice.

Replace the asserts with explicit ``return None`` guards, matching the
existing "no boxed found" branch. Well-formed input keeps producing the
same string; malformed input now folds into the standard extraction-
failure path that callers already handle by counting None predictions.

Applied to all three sibling copies (chains/aime, prompts/aime,
prompts/gsm8k) and covered by a parametrized regression suite.
Each experiment module now states the problem, the algorithm /
pipeline / engine / LLM choice it showcases, and any unusual
constraint in 2-4 lines so a user scanning the experiments/
directory can pick the right starting point without reading the body.

``runner_presets`` gains the same compose-into-experiment example
the other ``*_presets`` modules already carry so the surface is
uniform across the preset layer.
The TYPE_CHECKING guard contained only a 'pass' placeholder. Remove it
along with the unused TYPE_CHECKING import.
Fix B007 in tools/throughput_plot.py and tools/wizard/__main__.py where
the loop control variable is discarded inside the body.
PIE790: each exception class already has a docstring, which satisfies
the suite's body requirement on its own.
Three chain validators return `(metrics, failures)` tuples but advertise
`-> dict`. The runtime contract in `CallValidatorFunction.parse_output`
already accepts both shapes, so behaviour is unchanged — this is a pure
annotation/docstring repair so type-checkers and readers see the actual
return type.

Touched: chains/hover/static, chains/hotpotqa/static_ra, chains/hotpotqa/static_a.
The per-instruction loop wrote the None-stripped kwargs dict back into
`input["kwargs"][index]`. Because `DataFrame.to_dict(orient="records")`
shares the underlying list cells with the source frame, that write
poisoned the dataset for any subsequent validate() call that reused
the cached frame. Filter into a local dict instead; the dataset stays
pristine across iterations.
…scorable

`calculate_fitness` returned `None` when no rule had multi-class
coverage. The selectors call `extract_fitness_values`, which negates
`value` for minimization objectives — a `None` propagates as a
`TypeError` on `-None`. Substitute `0.0` so degenerate batches surface
as "no signal" rather than crashing the engine, and annotate the
function with `-> float` to document the contract.
`tyro.cli(..., args=["--help"])` always raises `SystemExit(0)` via
argparse, so the trailing `return 0` could never execute. Remove the
dead line and document the exit semantics inline so future readers
don't reintroduce the assumption that control falls through.
- redis/metrics._flatten_numbers: the ternary on key construction had
  identical 'then' and 'else' expressions; collapse to a single literal.
- tools/lineage: tools/**/*.py already ignores E402 globally, so the
  per-import noqa: E402 directives are redundant.
RUF059: serve_until_signal discards the 'done' set returned by
asyncio.wait, and trajectory only reads prev_v from the trailing
improvement_points tuple. Prefix with underscore so the intent is
visible at the unpack site.
redis-py stubs share signatures between the sync and async clients, so
return types widen to Union[Awaitable[T], T]. The sync client always
returns the concrete value, but pyright narrows on the Awaitable side
and flags every lrange/hgetall/keys site. Add typing.cast narrowings
where the call sites are; behaviour at runtime is unchanged.
The smoothed array is built across five branches; one path yields a
pandas Series.values whose dtype the numpy stubs cannot align with the
boolean-indexed __setitem__ signature. asarray pins the runtime type
without changing the produced values.
…arnings

MagicMock spoofs isinstance checks by rebinding __class__; the type
checker rejects the assignment, but the runtime pattern is documented
behaviour. Annotate the two assignments with the standard misc ignore.
… literal

The constant was annotated Final[str], so preset builders passed
list[str] into BehaviorSpaceConfig.binning_types whose declared
list[BinningType] is invariant. Re-typing the constant against the
schema literal lets the presets typecheck without runtime change. The
import is aliased with an underscore prefix so the defaults namespace
stays free of foreign symbols.
init_composite returns CompositeLogger, which is a sibling of
GenericLogger under LogWriter rather than a subclass. The previous
GenericLogger return annotation misrepresented the concrete return and
broke type narrowing on every caller; the unit test already asserts
isinstance(writer, CompositeLogger).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants